Penguins Dataset

Data Exploration

The penguin dataset is composed of 344 observations with 8 variables, 5 of which are numeric and 3 which are qualitative. The dataset is mostly complete with just a few observations with missing values that will need to be handled.

Data summary
Name penguins
Number of rows 344
Number of columns 8
_______________________
Column type frequency:
factor 3
numeric 5
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
species 0 1.00 FALSE 3 Ade: 152, Gen: 124, Chi: 68
island 0 1.00 FALSE 3 Bis: 168, Dre: 124, Tor: 52
sex 11 0.97 FALSE 2 mal: 168, fem: 165

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bill_length_mm 2 0.99 43.92 5.46 32.1 39.23 44.45 48.5 59.6 ▃▇▇▆▁
bill_depth_mm 2 0.99 17.15 1.97 13.1 15.60 17.30 18.7 21.5 ▅▅▇▇▂
flipper_length_mm 2 0.99 200.92 14.06 172.0 190.00 197.00 213.0 231.0 ▂▇▃▅▂
body_mass_g 2 0.99 4201.75 801.95 2700.0 3550.00 4050.00 4750.0 6300.0 ▃▇▆▃▂
year 0 1.00 2008.03 0.82 2007.0 2007.00 2008.00 2009.0 2009.0 ▇▁▇▁▇

The target variable of interest is the species of penguins, which are categorized into three groups: Adelie, Gentoo and Chinstrap penguins.

## [1] Adelie    Gentoo    Chinstrap
## Levels: Adelie Chinstrap Gentoo

Species Distribution on Islands

From this plot, we can make a few key observations:

  • Gentoo penguins are only found on Biscoe Island
  • Chinstrap pengiuns only found on Dream Island
  • Adelie penguins are found on all three islands
  • Torgersen Island only has Adelie penguins

These island observations are valuable information in differentiating penguin species.

Sex Distribution

However, the sex of the penguins does not offer much information as the proportion is about even across all species. We can also note a few missing observations labeled as NA.

Missing Values & Variable Selection

We noted from the data summary above that 11 observations were missing for the sex variable. There is also no reason to believe that the year the observation was taken would have any impact on the morphology of the penguins. We are not looking for any time series modeling. Therefore, we also drop year from our predictor variables. There are also two observations which are missing body measurements altogether, so these rows will be dropped altogether.

Body Measurements

When looking at body measurements we see that Adelie and Chinstrap penguins largely overlap except for bill_length. This suggests that we might be able to use bill_depth, body_mass and flipper_length to differentiate the Gentoo penguins from the other species. However, the Adelie penguin stands out from the other others in bill_length

The scatterplot matrix below is another way to visualize the separation and overlap between classes for different combination of variables. We see that in general, Gentoo penguins standalone as a separate group. However, Adelie and Chinstrap penguins overlap in the comparison of bill_depth, flipper_length and body_mass.

We see on the univariate feature plots below that the data is aproximatelly normally distributed.

Multicollinearity

Taking a look at the correlation matrix below, we can make a few observations, notably that flipper_length is highly positively correlated with body_mass which makes sense given that larger penguins should have larger flippers. The other correlations are less obvious to interpret. Given that the dataset only contains a few predictors, we choose not to exclude any variables based on multicollinearity at this time.

K-Nearest Neighbors

The KNN algorithms requires minor data processing. Firstly, predictor values that are factors should be conversted to numeric. Secondly, because KNN uses distance between points to determine their classification, it is important for the points to be on the scaled appropriately. Here we pass the scale argument to the preProcess parameter of the training function to standardize each variable. The data is then split into training and testing sets 80%/20%. The test set contains 65 observations and the train set 268 observations.

Modeling

We performed 10-fold cross-validation in the training data to determine the optimal parameter k for our model. The resulting accuracy for each value of k is displayed and plotted below. The maximum accuracy is reached with values of k=3 and k=4 but the training procedure automatically chose k=4 as the best model. We gain a full percentage point in cross-validation accuracy on the training data using the tuned model over models with slightly more or fewer neighbors.

## k-Nearest Neighbors 
## 
## 268 samples
##   5 predictor
##   3 classes: 'Adelie', 'Chinstrap', 'Gentoo' 
## 
## Pre-processing: centered (5), scaled (5) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 241, 241, 240, 242, 241, 242, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    1  0.9851852  0.9769336
##    2  0.9850427  0.9765099
##    3  0.9962963  0.9941558
##    4  0.9962963  0.9941558
##    5  0.9887464  0.9821064
##    6  0.9851750  0.9764839
##    7  0.9851750  0.9764839
##    8  0.9888787  0.9823281
##    9  0.9888787  0.9823281
##   10  0.9888787  0.9823281
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 4.

Results

The evaluation of the tuned K-NN model on the testing data reveals that the model was able to classify species with perfect accuracy. However, it is important to note that 100% prediction accuracy is typically rare and that this model benefitted from fairly clean class separations and limited overlap in the original dataset.

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        29         0      0
##   Chinstrap      0        13      0
##   Gentoo         0         0     23
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9448, 1)
##     No Information Rate : 0.4462     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 1.0000              1.0        1.0000
## Specificity                 1.0000              1.0        1.0000
## Pos Pred Value              1.0000              1.0        1.0000
## Neg Pred Value              1.0000              1.0        1.0000
## Prevalence                  0.4462              0.2        0.3538
## Detection Rate              0.4462              0.2        0.3538
## Detection Prevalence        0.4462              0.2        0.3538
## Balanced Accuracy           1.0000              1.0        1.0000

Loan Approval Dataset

A loan-issuance company with presence across urban, semi urban and rural areas wants to validate the eligibility of customers to be granted a loan. Using an online application form, customers enter a series of attributes such as Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate the eligibility process, multiple supervised classification approaches can be used. In our case, we explore tree-based methods starting from simple Decision Trees (DT). Ensemble methods such as Random Forests (RF) and Gradient Boosting (GB) are also used to improve on the classification accuracy of the simple DT.

The loan dataset is composed of 13 variables and 614 observations. The target or dependent variable is Loan_Status and contains ‘Y’ or ‘N’ as entries. Predicting whether a long will be approved is a binary classification problem.

Eight of the variables are factors and 5 are numeric. The dataset contains missing values recorded either as ‘NA’ or simply empty. Some columns are missing nearly 10% of observations. Imputation of the missing values will be a step in the data pre-processing. We also note that the Loan_ID variable is simply an index and holds no valuable information, making it safe for removal. The Credit_History variable is coded as numeric but it is a binary variable with two levels.

Data summary
Name loan
Number of rows 614
Number of columns 13
_______________________
Column type frequency:
factor 8
numeric 5
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Loan_ID 0 1.00 FALSE 614 LP0: 1, LP0: 1, LP0: 1, LP0: 1
Gender 13 0.98 FALSE 2 Mal: 489, Fem: 112, emp: 0
Married 3 1.00 FALSE 2 Yes: 398, No: 213, emp: 0
Dependents 15 0.98 FALSE 4 0: 345, 1: 102, 2: 101, 3+: 51
Education 0 1.00 FALSE 2 Gra: 480, Not: 134
Self_Employed 32 0.95 FALSE 2 No: 500, Yes: 82, emp: 0
Property_Area 0 1.00 FALSE 3 Sem: 233, Urb: 202, Rur: 179
Loan_Status 0 1.00 FALSE 2 Y: 422, N: 192

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ApplicantIncome 0 1.00 5403.46 6109.04 150 2877.5 3812.5 5795.00 81000 ▇▁▁▁▁
CoapplicantIncome 0 1.00 1621.25 2926.25 0 0.0 1188.5 2297.25 41667 ▇▁▁▁▁
LoanAmount 22 0.96 146.41 85.59 9 100.0 128.0 168.00 700 ▇▃▁▁▁
Loan_Amount_Term 14 0.98 342.00 65.12 12 360.0 360.0 360.00 480 ▁▁▁▇▁
Credit_History 50 0.92 0.84 0.36 0 1.0 1.0 1.00 1 ▂▁▁▁▇

Data Pre-processing

The pre-processing steps are the following:

  • Creating a new variable called TotalIncome by summing applicant and coapplicant incomes. Typically loan issuers take into account the combined income of the applicant and guarantor.
  • Dropping the valueless variable Loan_ID and the individual income variables that were just combined.
  • Treating Credit_History as a factor with 2 levels instead of a numeric variables
  • Imputation of missing values

Note that the tree based methods employed in this exercise are not required to be coded as numeric or expanded as dummy variables. We can see from the data summary above that the remainder of the variables have the proper data type.

Imputation of Missing Values

Looking at the Pattern plot of missing values, we see a slight discernible pattern from variable to variable and observation to observation. Therefore this dataset has missing values at random (MAR), such that the probability of value of a given missing predictor observation depends on a observed value of a different predictor variable.

A simple way to dealing with missing values is to conduct “complete case analysis” in that any observations which have a missing value for a given predictor are dropped. However given the small number of observations (614) on this dataset, we will impute the missing values using various predictive models

## 
##  Variables sorted by number of missings: 
##          Variable       Count
##    Credit_History 0.081433225
##     Self_Employed 0.052117264
##        LoanAmount 0.035830619
##        Dependents 0.024429967
##  Loan_Amount_Term 0.022801303
##            Gender 0.021172638
##           Married 0.004885993
##         Education 0.000000000
##     Property_Area 0.000000000
##       Loan_Status 0.000000000
##       TotalIncome 0.000000000

Given the assumption that this dataset has Missing At Random Values, we can use the MICE (Multivariate Imputation via Chained Equations) package to impute the missing values throughout the dataset. For each type of predictor variable we will use:

  • Numeric Variables: Predictive Mean Matching (ppm)
  • Binary Variables: Logistic Regression (logreg)
  • Factor Variables (3+ levels): Bayesian Polytomous Regression (polyreg)
## 
##  iter imp variable
##   1   1  Gender  Married  Dependents  Self_Employed  LoanAmount*  Loan_Amount_Term*  Credit_History
##   1   2  Gender  Married  Dependents  Self_Employed  LoanAmount*  Loan_Amount_Term*  Credit_History
##   1   3  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term*  Credit_History
##   1   4  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History
##   1   5  Gender  Married  Dependents  Self_Employed  LoanAmount*  Loan_Amount_Term  Credit_History
##   2   1  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term*  Credit_History
##   2   2  Gender  Married  Dependents  Self_Employed  LoanAmount*  Loan_Amount_Term*  Credit_History
##   2   3  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term*  Credit_History
##   2   4  Gender  Married  Dependents  Self_Employed  LoanAmount*  Loan_Amount_Term*  Credit_History
##   2   5  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term*  Credit_History
##   3   1  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History
##   3   2  Gender  Married  Dependents  Self_Employed  LoanAmount*  Loan_Amount_Term  Credit_History
##   3   3  Gender  Married  Dependents  Self_Employed  LoanAmount*  Loan_Amount_Term*  Credit_History
##   3   4  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History
##   3   5  Gender  Married  Dependents  Self_Employed  LoanAmount*  Loan_Amount_Term*  Credit_History
##   4   1  Gender  Married  Dependents  Self_Employed  LoanAmount*  Loan_Amount_Term*  Credit_History
##   4   2  Gender  Married  Dependents  Self_Employed  LoanAmount*  Loan_Amount_Term*  Credit_History
##   4   3  Gender  Married  Dependents  Self_Employed  LoanAmount*  Loan_Amount_Term*  Credit_History
##   4   4  Gender  Married  Dependents  Self_Employed  LoanAmount*  Loan_Amount_Term  Credit_History
##   4   5  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term*  Credit_History
##   5   1  Gender  Married  Dependents  Self_Employed  LoanAmount*  Loan_Amount_Term*  Credit_History
##   5   2  Gender  Married  Dependents  Self_Employed  LoanAmount*  Loan_Amount_Term  Credit_History
##   5   3  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term*  Credit_History
##   5   4  Gender  Married  Dependents  Self_Employed  LoanAmount*  Loan_Amount_Term*  Credit_History
##   5   5  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term*  Credit_History
## 
## TRUE 
##  614

## 
##  Variables sorted by number of missings: 
##          Variable Count
##            Gender     0
##           Married     0
##        Dependents     0
##         Education     0
##     Self_Employed     0
##        LoanAmount     0
##  Loan_Amount_Term     0
##    Credit_History     0
##     Property_Area     0
##       Loan_Status     0
##       TotalIncome     0
##           Gender          Married       Dependents        Education 
##         "logreg"         "logreg"        "polyreg"               "" 
##    Self_Employed       LoanAmount Loan_Amount_Term   Credit_History 
##         "logreg"            "pmm"            "pmm"         "logreg" 
##    Property_Area      Loan_Status      TotalIncome 
##               ""               ""               ""

Data Exploration

By examining the target variables, we see that nearly 70% of all loans are approved.

We can make a few observations from boxplots of the numeric variables below:

  • Loan_Amount_Term only take a few discrete values representing various durations. The most common value by far, as indicated by the flat box is around 360 meaning that the most common loan term is 1 year. This was fairly consistent across both outcomes of the dependent variable.
  • LoanAmount does not greatly differ across the dependent variable. The interquartile range is slightly more compressed in the ‘Y’ category and there is a greater range of outliers values on the upper end of the range. The mean loan amounts were comparable for both outcomes.
  • TotalIncome is fairly similar across the outcomes up to about $30,000 in total income. Interestingly, the observation with the highest total income recorded is the greatest outlier and was not issued a loan. Large skew is observed across both categories.

From the bar plots of the categorical variables shwon below, we observe the following:

  • Credit_History: The majority of applicants had credit history; the large majority of which were approved for a loan. This is likely to be one of the most important factor in determining who gets a loan.
  • Dependents: Individuals with 0 dependents form the majority of the cohort. Individual with fewer dependents may be less risk adverse and more willing to take on debt.
  • Education: More individuals with graduate education applied for loans and a greater proportion of them received one in comparison to the “Not Graduate” counterparts.
  • Gender: More than 3 times more males applied for loans than females. Both genders seem to be granted loans in the same proportion.
  • Married: Married individuals applied for loans. This could be a consequence of needing to finance something like a home or a car which is more typical of married households.
  • Property_Area: Inviduals living in semi-urban propety areas applied for the most number of loans but also had the greatest proportion of approved loans. Urban areas follow with with approximately 50% of approved load while rural areas has fewer applicants and a greater proportion of rejections.
  • Self_Employed: Individuals who were not self-employed made up the large majority of the observations. This makes sense given that in general salaried employees greatly outnumber self-employed employees. Additionally, a self-employed individual may have less consistent streams of revenue and therefore might be less willing to take on debt.

Interestingly, there is virtually no difference in the relationship between loan amount and total income across the credit history category as seen by the nearly collinear regression lines. We can also observe that individuals on the low end of the total income axis and below the regression line generally had credit history. These are individuals with larger incomes but requesting less sizable loans.

The least squares lines across the dependent variable are also nearly collinear but the line representing individuals who received loans has a slightly greater slope suggesting that higher incomes unlock larger loans which is sensible and expected.

Other than credit history, total income (applicant income + coapplicant income) seems like the most logical basis for approving or denying a loan. The histograms below compare total income across all levels of the categorical variables. Total income is skewed to the right tail, with a few observations at the higher end of the log-transformed income scale.

MI: ADD MORE DESCRIPTION

Data Processing

The data pre-processing section the structural form of the data. No additional processing is required for tree-based methods. We partition our data into training and testing sets in a 70%/30% proportion.

Decision Trees

A Decision Tree is a type of supervised learning model where the data is recursively split into two or more sub-populations given a criteria. Each split is headed by a node, where the upper most node is called the root node, and nodes resulting in no more additional splits called terminal (leaf) nodes. All other nodes are considered internal nodes. Based on a given population of observations, the population is split into sub-populations with the criteria that each split distinguishes the sub-population better than the previous split. This splitting recursively continues on different predictor variables such that a leaf nodes can make accurate predictions/conclusions on a given observation

The Decision Tree model can be used for regression and classification. In this case we use the model for classification of loan approvals

Simple Tree

The 1st step is to grow a baseline decision tree based on splits generating the most Information Gain, which is based on the decrease in entropy after a dataset is split on a given attribute.

Post growing a baseline tree we will need to decide if it requires purning so that we can remove leaf nodes for which we are not able to draw accurate conclusions on. Given that we want to do this, the Cost Complexity parameter set to 0.

The Cost Complexity parameter is slightly different from Information Gain and Gini index, but conceptually it is similar. The parameter the amount by which splitting a given node improved the relative error, in other words, it is the minimum improvement in the overall model needed at each node to justify the split.

## n= 431 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 431 135 Y (0.31322506 0.68677494)  
##     2) Credit_History=0 73   6 N (0.91780822 0.08219178) *
##     3) Credit_History=1 358  68 Y (0.18994413 0.81005587)  
##       6) TotalIncome< 2457 8   2 N (0.75000000 0.25000000) *
##       7) TotalIncome>=2457 350  62 Y (0.17714286 0.82285714)  
##        14) TotalIncome>=20199.5 11   5 N (0.54545455 0.45454545) *
##        15) TotalIncome< 20199.5 339  56 Y (0.16519174 0.83480826)  
##          30) Loan_Amount_Term>=420 10   5 N (0.50000000 0.50000000) *
##          31) Loan_Amount_Term< 420 329  51 Y (0.15501520 0.84498480)  
##            62) Property_Area=Rural 92  21 Y (0.22826087 0.77173913)  
##             124) TotalIncome< 3606.5 9   4 N (0.55555556 0.44444444) *
##             125) TotalIncome>=3606.5 83  16 Y (0.19277108 0.80722892)  
##               250) Dependents=1 7   3 N (0.57142857 0.42857143) *
##               251) Dependents=0,2,3+ 76  12 Y (0.15789474 0.84210526) *
##            63) Property_Area=Semiurban,Urban 237  30 Y (0.12658228 0.87341772)  
##             126) TotalIncome>=3416 215  30 Y (0.13953488 0.86046512)  
##               252) TotalIncome>=6577 77  15 Y (0.19480519 0.80519481)  
##                 504) TotalIncome< 6800 7   2 N (0.71428571 0.28571429) *
##                 505) TotalIncome>=6800 70  10 Y (0.14285714 0.85714286) *
##               253) TotalIncome< 6577 138  15 Y (0.10869565 0.89130435) *
##             127) TotalIncome< 3416 22   0 Y (0.00000000 1.00000000) *

Given the baseline tree we see that it is 8 levels deep. Each node shows the predicted class, the probability of that class and the percentage of observations that fall into the given node.

Looking at the splits we see that after rejecting an applicant based based on the lack of credit history, the applicant has a 81% percent of being approved regardless of all the other variables. If we factor in Income (TotalIncome) which is an important feature of any loan application, the percentage of approval only increases by 1% to 82%. Loan Amount and then Property Area respectively also only increase accuracy by 1% each.

Predict loan_status on test set and calculate accuracy

##       Predicted
## Actual   N   Y
##      N  29  28
##      Y  21 105
## Accuracy:  0.7322404

Pruned Tree

We now try to prune the tree based on cross-validated error rates. In other words, we will try to find a Cost Complexity pramater which has the lowest cross-validation error rate and then supply this value back into growing the decision tree, but stop growing past a certain nodes where the cross-validation error rate is not minimal

From the below we see that the error rate is minimized when the tree is only at 2 levels. This indicates the Decision Tree is provides the most accurate predictions at 2 levels, which is just 1 split on Credit History. This is an unrealistic tree and for practical reasons, we chose to find the cost complexity parameter where pruning the tree will result in atleast 2 splits

## 
## Classification tree:
## rpart(formula = Loan_Status ~ ., data = training, method = "class", 
##     parms = list(split = "information"), control = rpart.control(cp = 0))
## 
## Variables actually used in tree construction:
## [1] Credit_History   Dependents       Loan_Amount_Term Property_Area   
## [5] TotalIncome     
## 
## Root node error: 135/431 = 0.31323
## 
## n= 431 
## 
##          CP nsplit rel error  xerror     xstd
## 1 0.4518519      0   1.00000 1.00000 0.071325
## 2 0.0296296      1   0.54815 0.54815 0.057993
## 3 0.0074074      2   0.51852 0.57037 0.058908
## 4 0.0052910      3   0.51111 0.60741 0.060360
## 5 0.0000000     10   0.47407 0.76296 0.065582
## Minimum Error:  0.007407407

Prune tree based on CP value that has lowest cross validation error rate with atleaset 2 splits. The resulting tree is 3 levels levels and the most significant predictors are Credit History and Total Income

Accuracy of pruned tree

##       Predicted
## Actual   N   Y
##      N  25  32
##      Y   7 119
## Accuracy:  0.7868852

Random Forest

The Random Forest model is similar to Decision Tree model in terms of how it constructs a tree. The difference is that the Random Forest model is a collection (ensamble) of Decision Trees. Each of these trees in the forest are constructed bootstrapped observations and randomized subset of features. Each tree in the forest then classifies a given observation and then the majority of the classification class is determined as the predicted class for that given observation

To build an optimal forest there are 2 parameters that we can tune:

  • ntree: the number of trees in the forest
  • mtry: the number of of random variables for each tree

For classification datasets, the generally recommended subset of variables recommended is the squareroot of the total number of predictor variables. In our dataset this is this is 3, since we have 10 total predictor variables

Form the results below we see that 2 random variables per tree in the forest generates the lowest Out of Bag error. In quick summary, when using bootstrapped samples, about a third of the observations do not end up getting used in construction of the trees, and the model’s prediction on these new (to the model) observations is the Out of Bag Error

Therefore we will set the mtry parameter to the number of variables that generate the lowest Out of Bag error rate

## mtry = 3  OOB error = 17.4% 
## Searching left ...
## mtry = 2     OOB error = 17.4% 
## 0 0.05 
## Searching right ...
## mtry = 6     OOB error = 18.56% 
## -0.06666667 0.05

## 
## Call:
##  randomForest(formula = Loan_Status ~ ., data = training, importance = TRUE,      ntree = 500, mtry = val_opt) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 16.71%
## Confusion matrix:
##    N   Y class.error
## N 72  63  0.46666667
## Y  9 287  0.03040541

Next we want to see if we have generated enough trees so that the Out Of Bag (OOB Error) error rates are minimum. From the below we see that the OOB error rate is decreasing with 1-20 trees, and rate stabilizes that at around 100 trees. Therefore for this dataset 500 trees is more than enough to minimize the OOB Error rate

An area of interest like Decision Trees is Variable importance, which is a scale of a given variable’s predictive power and is taken by taking the average purity of child nodes that the split causes across all trees in the forest. Variables with large increases in purity are considered more “important”

Below is a table of the predictor variables and their various importance factors. Some definitions to help understand the below table:

  • Accuracy Decrease: the decrease in accuracy of the overall model if a given predictor is not used in the model
  • Gini Decrease: For each split that is made using a given predictor there is a reduction in the Gini Index. The Gene Decrease is an average of this Gini Index reduction across the entire forest
  • Mean_min_depth: The depth in a tree that a give node occurred in

Predictors from the bootstrapped sample set that will give the highest purity are selected first. Credit History looks to be one of the most important predictors because:

  • Is usually at the root on most sampled trees
  • Has the lowest average depth
  • Most often chosen as a root

Looking at the multi-way importance plot we see that Credit History:

  • has the largest accuracy decrease if the variable is omitted from model
  • It has the highest Gini decrease of all variables

Random Forest Accuracy

##       Predicted
## Actual   N   Y
##      N  24  33
##      Y   4 122
## Accuracy:  0.7978142

Gradient Boosting

This is another kind on ensemble algorithm in which several model are combined, but instead of running the model is parallel and then combining results, here model are run in series, with each output serving as an input to the next. With each new run of the classifier, the results are improved. Also, each classifiers performance can actually be rather low (shallow trees can be used for example), and with feeding the results to a new classifier, the overall effect is an improved performance.

Run gradient boost model

## Warning: Setting `distribution = "multinomial"` is ill-advised as it is
## currently broken. It exists only for backwards compatibility. Use at your own
## risk.

Check the performance of the model, also check the optimum number of trees for “cv” method. Looks like optimum numbers of tress is 111. We will run our predictions with that number.

## [1] 116

From the model we see that features Credit_History, TotalIncome and LoanAmount features are most important variables of the model and Gender is the least(which is good).

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   N   Y
##          N  23  34
##          Y   4 122
##                                           
##                Accuracy : 0.7923          
##                  95% CI : (0.7263, 0.8487)
##     No Information Rate : 0.8525          
##     P-Value [Acc > NIR] : 0.9893          
##                                           
##                   Kappa : 0.4344          
##                                           
##  Mcnemar's Test P-Value : 2.546e-06       
##                                           
##             Sensitivity : 0.8519          
##             Specificity : 0.7821          
##          Pos Pred Value : 0.4035          
##          Neg Pred Value : 0.9683          
##              Prevalence : 0.1475          
##          Detection Rate : 0.1257          
##    Detection Prevalence : 0.3115          
##       Balanced Accuracy : 0.8170          
##                                           
##        'Positive' Class : N               
## 
##       Predicted
## Actual   N   Y
##      N  23  34
##      Y   4 122
## Accuracy:  0.7923497

So with gradient boosting we got 80.32% accuracy.

Model Performance